Predicting the quality of text-to-speech systems from a large-scale feature set
نویسندگان
چکیده
We extract 1495 speech features from 2 subjectively evaluated text-to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CVmethod where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.
منابع مشابه
سیستم شناسایی و طبقه بندی اسامی در متون فارسی
Name entity recognition (NER) is a system that can identify one or more kinds of names in a text and classify them into specified categories. These categories can be name of people, organizations, companies, places (country, city, street, etc.), time related to names (date and time), financial values, percentages, etc. Although during the past decade a lot of researches has been done on NER in ...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملمراحل و نحوه ی تهیه ی دادگان های صوتی هجایی و دایفونی برای سامانه ی تبدیل متن به گفتار فارسی
Abstract Speech databases are part of the concatenative text to speech synthesis systems. Phonetic quality of the databases plays a significant role in the naturalness of the synthesized speech. This paper introduces two syllable and diphone speech databases for Persian and investigates the way of their development and their specifications and their advantages to each other. ...
متن کاملStudy on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013